This submission is our work alone and complies with the 30535 integrity policy.
Add your initials to indicate your agreement: HJ, SN
Add names of anyone you discussed this problem set with: **__**
Late coins used this pset: 2. Late coins left: 1.
Prelim questions
A: Yes.
filter() to reduce the amount of data you pull while exploring data. For example, you can filter by time and location to only get data for a small part of the city and/or over a short time period.collect() a small sample data set so that the you have data in memory on yourcollect() the entire data set each time you want to work with it.A. (a) will not cause problems, becuase if you are only exploring data and you know enough about the fields and possible values that you can make knowledgeable filters and not mess anything up, then filtering is a good way to do minor data exploration quickly without waiting to make full dataset pulls. However this is obviously is only good for explorations, not solutions, as the question indicates, and you are still going back to the database each time with your new filter, whereas it could be easier to cut the datafile size down another way. In that case, collect() could be useful, where you could test certain relationships between key variables without having to draw on the whole dataset. In (c), the problem here is not the limitation of the data obviously, since you have it all, but pulling it each time, such as with this applied problem set, takes a while and is often not feasible. So out of any of these, the third is the least feasible, while the first two at least have limited usefulness.
2.. As is the case with any data set, Waze has to make decisions about what data to store and how to measure it. Review the data documentation and the rest of the problem set. Propose a variable that Waze could feasibly track that is not available now or feasible and better way a to measure a variable currently in the dataset. Support your proposal with reasoning.
A: I think it would be interesting to see if they started recording the number of incidences within a specific geographic “zone” (i.e. a traffic light just before an on-ramp, a bridge where there are a lot of rear-end collisions, etc.), aggregating these into a total, and then forming that total into a sort of “score” field that could be read as the “danger” score of a any specific geographic location. As it stands, I don’t see anything specifically related to the history of a specific location or specific type of location, and I think this could be an interesting field to have because of the potential research purposes that would be made a lot easier, such as identifying the pros and cons of different street configurations, and right now there’s nothing really like that in the data, not unless you make some calculations on your own, which don’t get you far.
A: There are many ways self-selection could influence the data. One is that Waze users are highly focused in certain professions like Uber drivers, which could change the road coverage of the app, leaving certain roads that are not along major “tourist” pathways predictably underserved, such as residential areas. So Waze might simply not work in residential areas. On the other hand, Waze users went through the trouble of downloading the app and going through the trouble of being engaged users are probably young and technologically fluent, and so coule be predicated to be less accident prone and more engage in the road than senior drivers, which could influence the data by making it more reliable than you would otherwise predict from a random sampling of US drivers.
#Corridor 7 = N Western from George to Chicago; x = 41.934212; x = 41.895840
corridor_seven_a <- chicago_data %>%
filter(str_detect(street,"N Western") == TRUE) %>%
arrange(location_y)
corridor_seven_b <- corridor_seven_a %>%
filter(location_y <= 41.934212 & location_y >= 41.895840)
bbox <- c(left = -87.73,
bottom = 41.89,
right = -87.65,
top = 41.937)
corridor_map <- get_stamenmap(bbox, zoom = 14)
## Source : http://tile.stamen.com/terrain/14/4199/6085.png
## Source : http://tile.stamen.com/terrain/14/4200/6085.png
## Source : http://tile.stamen.com/terrain/14/4201/6085.png
## Source : http://tile.stamen.com/terrain/14/4202/6085.png
## Source : http://tile.stamen.com/terrain/14/4199/6086.png
## Source : http://tile.stamen.com/terrain/14/4200/6086.png
## Source : http://tile.stamen.com/terrain/14/4201/6086.png
## Source : http://tile.stamen.com/terrain/14/4202/6086.png
## Source : http://tile.stamen.com/terrain/14/4199/6087.png
## Source : http://tile.stamen.com/terrain/14/4200/6087.png
## Source : http://tile.stamen.com/terrain/14/4201/6087.png
## Source : http://tile.stamen.com/terrain/14/4202/6087.png
## Source : http://tile.stamen.com/terrain/14/4199/6088.png
## Source : http://tile.stamen.com/terrain/14/4200/6088.png
## Source : http://tile.stamen.com/terrain/14/4201/6088.png
## Source : http://tile.stamen.com/terrain/14/4202/6088.png
CorridorSeven <- ggmap(corridor_map)
CorridorSeven +
geom_point(data = corridor_seven_b, aes(x = location_x, y = location_y), alpha = 1/10, position = "jitter")
CorridorSeven +
stat_density2d(
aes(x = location_x, y = location_y, fill = ..level.., alpha = ..level..),
size = 2, bins = 4, data = corridor_seven_b,
geom = "polygon", contour = "TRUE"
)
ggplot(data = corridor_seven_b, aes(x = location_y)) +
geom_bar(binwidth = .0001)
## Warning: `geom_bar()` no longer has a `binwidth` parameter. Please use
## `geom_histogram()` instead.
ggplot(data = corridor_seven_b, aes(x = location_y)) +
geom_bar() + xlim(c(41.9065, 41.908))
## Warning: Removed 16532 rows containing non-finite values (stat_count).
ggmap(corridor_map) +
geom_bin2d(data = corridor_seven_b, aes(x = location_x, y = location_y))
A: Accidents seem most common just south of the intersection of Western and North Avenue, around the latitude (location_x) x = 41.9068, which is right next to an intersection of two busy streets plus a nearby hospital, perhaps bringing in a lot of ambulances onto an already busy area. It is also near an intersection where a diagonal street intersects the street grid, which is conducive to crashes if I remember correctly.
On October 21, the City of Chicago declared the 79 and 66 bus routes as areas of focus for transit oriented development. The City says the plan addresses bus “slow zones”. Note: Watch out for “179th St”.
#79th Street Bus:
#79th/Lake Shore, -87.539729, 41.751907
#79th/Cicero (first western stop actually on 79th), 41.749274, -87.741271
bus79 <- chicago_data %>%
filter(street == "W 79th St" | street == "E 79th St") %>%
filter(type == "JAM" | type == "ACCIDENT") %>%
arrange(location_x)
bus79 <- bus79 %>%
filter(location_x >= -87.741271 & location_x <= -87.539729)
#66th Street Bus:
#Chicago/Austin Terminal, 41.899321, -87.774772
#215 E Chicago Ave (first eastern stop on Chicago), 41.896650, -87.620773
bus66 <- chicago_data %>%
filter(street == "W Chicago Ave") %>%
filter(type == "JAM" | type == "ACCIDENT") %>%
arrange(location_x)
bus66 <- bus66 %>%
filter(location_x >= -87.774772 & location_x <= -87.620773)
##idea: on x axis, plot time, binned by 15 minute intervals
##need: extract time from the scrape time
bus79 <- bus79 %>%
mutate(
time = chron(times = substr(scrape_dt, 12, 19))
)
ggplot(data = bus79, aes(x = time)) +
geom_bar() +
scale_x_time("time") +
ggtitle("Bus 79")
bus66 <- bus66 %>%
mutate(
time = chron(times = substr(scrape_dt, 12, 19))
)
ggplot(data = bus66, aes(x = time)) +
geom_bar() +
scale_x_time("time") +
ggtitle("Bus 66")
b. Using a reasoned approach, choose two additional corridors for comparison.
i. What corridors did you choose and why?
ii. Make comparison plots.
##idea: redo above section, w/ two new inputs
##need: select two geocoordinates off of google maps
#4 Cottage Grove Bus:
#Michigan/Madison, 41.882300, -87.624235
#3000 S Michigan, 41.841611, -87.623177
bus4 <- chicago_data %>%
filter(street == "S Michigan Ave") %>%
filter(type == "JAM" | type == "ACCIDENT") %>%
filter(location_y >= 41.841611 & location_y <= 41.882300)
bus4 <- bus4 %>%
mutate(
time = chron(times = substr(scrape_dt, 12, 19))
)
ggplot(data = bus4, aes(x = time)) +
geom_bar() +
scale_x_time("time") +
ggtitle("Bus 4")
#8 Halsted Bus:
#Halsted and Monroe, 41.880616, -87.647048
#Halsted & 79th, 41.750765, -87.644087
bus8 <- chicago_data %>%
filter(str_detect(street, "S Halsted")) %>%
filter(type == "JAM" | type == "ACCIDENT") %>%
filter(location_y >= 41.750765 & location_y <= 41.880616)
bus8 <- bus8 %>%
mutate(
time = chron(times = substr(scrape_dt, 12, 19))
)
ggplot(data = bus8, aes(x = time)) +
geom_bar() +
scale_x_time("time") +
ggtitle("Bus 8")
A: The approach: the site activetrans.org identifies two additional bus routes, #4 cottage grove, and #8 halsted, as being one of the busiest routes in Chicago while also showing a dip in ridership in recent years, showing space for improvement.
c. Looking beyond traffic, what other alerts are very common in this area?
Do you think these alerts would slow down the 66 / 79? If so, what steps
could the City take to address the issues?
##idea: drop traffic and take frequency
##need: filter and create table
#79th Street Bus:
#79th/Lake Shore, -87.539729, 41.751907
#79th/Cicero (first western stop actually on 79th), 41.749274, -87.741271
bus79_full <- chicago_data %>%
filter(street == "W 79th St" | street == "E 79th St") %>%
filter(type != "JAM" & type != "ACCIDENT") %>%
arrange(location_x)
bus79_full <- bus79_full %>%
filter(location_x >= -87.741271 & location_x <= -87.539729)
#66th Street Bus:
#Chicago/Austin Terminal, 41.899321, -87.774772
#215 E Chicago Ave (first eastern stop on Chicago), 41.896650, -87.620773
bus66_full <- chicago_data %>%
filter(street == "W Chicago Ave") %>%
filter(type != "JAM" & type != "ACCIDENT") %>%
arrange(location_x)
bus66_full <- bus66_full %>%
filter(location_x >= -87.774772 & location_x <= -87.620773)
bus66_79 <- rbind(bus66_full, bus79_full)
incidents <- table(bus66_79$subtype)
incidents
##
## HAZARD_ON_ROAD
## 213 6
## HAZARD_ON_ROAD_CAR_STOPPED HAZARD_ON_ROAD_CONSTRUCTION
## 404 2074
## HAZARD_ON_ROAD_ICE HAZARD_ON_ROAD_OBJECT
## 6 116
## HAZARD_ON_ROAD_POT_HOLE HAZARD_ON_ROAD_ROAD_KILL
## 25409 7
## HAZARD_ON_ROAD_TRAFFIC_LIGHT_FAULT HAZARD_ON_SHOULDER_ANIMALS
## 81 6
## HAZARD_ON_SHOULDER_CAR_STOPPED HAZARD_ON_SHOULDER_MISSING_SIGN
## 142 6
## HAZARD_WEATHER HAZARD_WEATHER_FLOOD
## 12 12
## HAZARD_WEATHER_FOG ROAD_CLOSED_CONSTRUCTION
## 6 89
## ROAD_CLOSED_EVENT
## 36381
A: The two most prevelant subtypes of issues are pot holes and road closures, so while the second can’t be fixed presumably (assuming it’s construction or something necessary), the pot holes would certainly slow traffic and could be fixed by road resurfacing.
Revisit the event which caused c5a73cc6-5242-3172-be5a-cf8990d70cb2.
chicago_data %>%
filter(uuid == "c5a73cc6-5242-3172-be5a-cf8990d70cb2")
## # A tibble: 4 x 17
## country nTHumbsUp city reportRating confidence reliability type uuid
## <chr> <int> <chr> <int> <int> <int> <chr> <chr>
## 1 US 0 Chic… 4 0 5 JAM c5a7…
## 2 US 0 Chic… 4 0 5 JAM c5a7…
## 3 US 0 Chic… 4 0 5 JAM c5a7…
## 4 US 0 Chic… 4 0 5 JAM c5a7…
## # … with 9 more variables: roadType <int>, magvar <int>, subtype <chr>,
## # street <chr>, location_x <dbl>, location_y <dbl>, pubMillis <dbl>,
## # reportDescription <chr>, scrape_dt <chr>
slakeshore <-
chicago_data %>%
filter(street == "S Lake Shore Dr")
slakeshore <- slakeshore %>%
mutate(date = as.Date(scrape_dt)) %>%
mutate(scrape_dt = ymd_hms(scrape_dt)) %>%
mutate(scrape_dt_hour = hour(with_tz(scrape_dt, tzone = "US/Central"))) %>%
filter(date == "2017-12-24")
## Warning: 1 failed to parse.
(bounding_box <- slakeshore %>%
filter(location_x >= -87.71875 & location_x <= -87.51875) %>%
filter(location_y >= 41.76272 & location_y <= 41.96272) %>%
select(contains("scrape_dt"), location_x, location_y, type, uuid, everything()))
## # A tibble: 326 x 19
## scrape_dt scrape_dt_hour location_x location_y type uuid
## <dttm> <int> <dbl> <dbl> <chr> <chr>
## 1 2017-12-24 00:10:00 18 -87.6 41.9 JAM 684d…
## 2 2017-12-24 00:15:00 18 -87.6 41.9 JAM 684d…
## 3 2017-12-24 00:20:00 18 -87.6 41.9 JAM 684d…
## 4 2017-12-24 00:25:00 18 -87.6 41.9 JAM 684d…
## 5 2017-12-24 00:30:00 18 -87.6 41.9 JAM 684d…
## 6 2017-12-24 00:35:00 18 -87.6 41.9 JAM 684d…
## 7 2017-12-24 00:40:00 18 -87.6 41.9 JAM 684d…
## 8 2017-12-24 00:45:00 18 -87.6 41.9 JAM 684d…
## 9 2017-12-24 00:55:00 18 -87.6 41.8 WEAT… af9e…
## 10 2017-12-24 01:00:00 19 -87.6 41.8 WEAT… af9e…
## # … with 316 more rows, and 13 more variables: country <chr>,
## # nTHumbsUp <int>, city <chr>, reportRating <int>, confidence <int>,
## # reliability <int>, roadType <int>, magvar <int>, subtype <chr>,
## # street <chr>, pubMillis <dbl>, reportDescription <chr>, date <date>
b. What causes all these jams? Some googling might help.
A: Weather cause all these jams. If you google, snowfall caused jam on 2017/12/24.
c. Plot the number of jams 6AM-6PM CST. Why are there two humps?
A: The humidity was the highest(92%) at 11:16am, which means that it snowed the most at 11am time bin. Also, the humidity was second highest (89%) around 2:53~3:53pm. On top of that, there was football game that Chicago Bears won on that day, although it might had very minor effect on the jams, due to the weather.
d. Place one vertical line at each hump.
bounding_box %>%
filter(scrape_dt_hour >= 6 & scrape_dt_hour <= 18)%>%
ggplot(aes(x = scrape_dt_hour)) +
geom_bar() +
labs(title = "Number of jams (2017/12/24 6AM-6PM CST)", x = "Hours") +
scale_x_continuous(breaks = seq(6, 18, by = 1)) +
geom_vline(xintercept=11, color = "red") +
geom_vline(xintercept=15, color = "red")
e. Next, propose a quantitative measure of traffic jam severity that combines
the number of traffic `JAM` alerts with information in the `subtype` variable.
A: I would like to propose fraction of ‘JAM’ subtypes to measure the overall jam severity. The proposed fraction is calculated by [# of subtype / (total # of JAM)]*100
measure <-
bounding_box %>%
filter(type == "JAM") %>%
group_by(type, subtype) %>%
count() %>%
arrange(desc(n))
measure %>%
mutate(fraction = (n / (62+37+16+15))*100)
## # A tibble: 4 x 4
## # Groups: type, subtype [4]
## type subtype n fraction
## <chr> <chr> <int> <dbl>
## 1 JAM JAM_STAND_STILL_TRAFFIC 62 47.7
## 2 JAM JAM_HEAVY_TRAFFIC 37 28.5
## 3 JAM "" 16 12.3
## 4 JAM JAM_MODERATE_TRAFFIC 15 11.5
f. Plot this measure from 6AM-6PM CST. Is there any information that is
conveyed by your severity measure that was not captured by plotting the number
of jams? If so, what is it?
A: Yes. By plotting this measure, I could detect the severity, most frequent type of JAM on 2017/12/24 from 6AM-6PM CST, because based on the subtype variable, I can classify jams. JAM_STAND_STILL_TRAFFIC > JAM_HEAVY_TRAFFIC > JAM(without subtype) > JAM_MODERATE_TRAFFIC
bounding_box %>%
filter(type == "JAM") %>%
filter(scrape_dt_hour >= 6 & scrape_dt_hour <= 18) %>%
group_by(type, subtype) %>%
count() %>%
arrange(desc(n)) %>%
mutate(fraction = (n / (62+37+16+15))*100) %>%
ggplot(aes(x = subtype, y = fraction)) +
geom_bar(stat="identity") +
coord_flip()
A: uuid for major accident is 637d46be-9830-39a4-b6e7-d5ccee2c806f. Location for x is -87.7011, y is 41.93575. Time the accident first appeared in the data is 2018-03-20 00:00:00. Sample alerts are in the time range 2018-03-19 22:00:00 ~ 2018-03-20 02:00:00
# uuid for major accident is 637d46be-9830-39a4-b6e7-d5ccee2c806f
chicago_data %>%
filter(subtype == "ACCIDENT_MAJOR") %>%
group_by(uuid) %>%
count() %>%
arrange(desc(n))
## # A tibble: 4,019 x 2
## # Groups: uuid [4,019]
## uuid n
## <chr> <int>
## 1 637d46be-9830-39a4-b6e7-d5ccee2c806f 101
## 2 fa8701d5-e01d-3c83-a6d1-1cbfc738fb64 100
## 3 7974cb78-65c1-339f-9f66-8db34be23ae9 29
## 4 829a5c3d-2acf-3b33-9873-c3d225b391f3 28
## 5 d56af23f-254e-34b4-9fa7-a46cb93f6dcd 26
## 6 840de6ec-b83c-30b9-9148-8c01b66ce1fc 24
## 7 98328a83-1df7-369b-9cc9-b15e0a87f5b8 20
## 8 c37299dc-14e8-32f2-9f93-071dba9e77c8 20
## 9 e259a73a-8158-37af-9e1f-a72f74083fc3 19
## 10 3b7f0ce9-901b-31ea-baf6-ce6c5a553fac 18
## # … with 4,009 more rows
# Time the accident first appeared in the data is 2018-03-20 00:00:00
chicago_data %>%
filter(uuid == "637d46be-9830-39a4-b6e7-d5ccee2c806f") %>%
mutate(time = substr(scrape_dt, 11, 19)) %>%
group_by(time) %>%
arrange(time) %>%
select(scrape_dt, time, uuid, everything())
## # A tibble: 101 x 18
## # Groups: time [101]
## scrape_dt time uuid country nTHumbsUp city reportRating confidence
## <chr> <chr> <chr> <chr> <int> <chr> <int> <int>
## 1 2018-03-… " 00… 637d… US 0 Chic… 2 2
## 2 2018-03-… " 00… 637d… US 0 Chic… 2 2
## 3 2018-03-… " 00… 637d… US 0 Chic… 2 2
## 4 2018-03-… " 00… 637d… US 0 Chic… 2 2
## 5 2018-03-… " 00… 637d… US 0 Chic… 2 2
## 6 2018-03-… " 00… 637d… US 0 Chic… 2 2
## 7 2018-03-… " 00… 637d… US 0 Chic… 2 2
## 8 2018-03-… " 00… 637d… US 0 Chic… 2 2
## 9 2018-03-… " 00… 637d… US 0 Chic… 2 2
## 10 2018-03-… " 00… 637d… US 0 Chic… 2 2
## # … with 91 more rows, and 10 more variables: reliability <int>,
## # type <chr>, roadType <int>, magvar <int>, subtype <chr>, street <chr>,
## # location_x <dbl>, location_y <dbl>, pubMillis <dbl>,
## # reportDescription <chr>
# Sample alerts with 2 hour before the first accident
library(hms)
##
## Attaching package: 'hms'
## The following object is masked from 'package:lubridate':
##
## hms
before_accd <-
chicago_data %>%
filter(location_x >= -87.7211 & location_x <= -87.6811) %>%
filter(location_y >= 41.91575 & location_y <= 41.95575) %>%
mutate(date = as.Date(scrape_dt)) %>%
mutate(time = substr(scrape_dt, 11, 19)) %>%
mutate(time = as.hms(time)) %>%
filter(date == "2018-03-19") %>%
filter(time >= as.hms("22:00:00"))
## Warning: as.hms() is deprecated, please use as_hms().
## This warning is displayed once per session.
# Sample alerts with 2 hour after the first accident
after_accd <-
chicago_data %>%
filter(location_x >= -87.7211 & location_x <= -87.6811) %>%
filter(location_y >= 41.91575 & location_y <= 41.95575) %>%
mutate(date = as.Date(scrape_dt)) %>%
mutate(time = substr(scrape_dt, 11, 19)) %>%
mutate(time = as.hms(time)) %>%
filter(date == "2018-03-20") %>%
filter(time <= as.hms("02:00:00"))
# Sample alerts by binding 2 dataset(before & after the accident)
alert_sample <- rbind(before_accd, after_accd)
# Plot with 5 min interval
alert_sample %>%
select(scrape_dt, date, time, subtype, everything()) %>%
mutate(dt = ymd_hms(scrape_dt)) %>%
group_by(dt) %>%
count() %>%
ggplot(aes(x = dt, y = n)) +
geom_line() +
labs(title = "# of traffic jam alerts before, after the accident that happened in 18-03-20 00:00:00",
x = "5 min interval for 2 hr before and after the accident",
y = "number of traffic jam") +
scale_x_datetime(date_breaks = "5 min", date_labels = "%H:%M") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
uuid, a date-time, a latitude and a longitude and returns a data frame with the number of alerts in each five-minute interval from two hours before to two hours after.# Function that takes a data frame of alerts, uuid.
get_accident1 <-
function(uuid) {
select(uuid, scrape_dt) %>%
mutate(dt = ymd_hms(scrape_dt)) %>%
group_by(dt) %>%
count()
}
get_accident2 <-
function(uuid, scrape_dt) {
select(uuid, scrape_dt) %>%
mutate(dt = ymd_hms(scrape_dt)) %>%
count()
}
# filter(dt < dt + 2*60*60 & dt > dt - 2*60*60)
alert_sample %>%
get_accident1()
## # A tibble: 49 x 2
## # Groups: dt [49]
## dt n
## <dttm> <int>
## 1 2018-03-19 22:00:00 18
## 2 2018-03-19 22:05:00 15
## 3 2018-03-19 22:10:00 18
## 4 2018-03-19 22:15:00 17
## 5 2018-03-19 22:20:00 18
## 6 2018-03-19 22:25:00 18
## 7 2018-03-19 22:30:00 18
## 8 2018-03-19 22:35:00 18
## 9 2018-03-19 22:40:00 18
## 10 2018-03-19 22:45:00 18
## # … with 39 more rows
# data frame with every major accident on 2017-11-20
nov_20_major_accd <-
chicago_data %>%
filter(subtype == "ACCIDENT_MAJOR") %>%
mutate(date = as.Date(scrape_dt)) %>%
filter(date == "2017-11-20") %>%
select(uuid, scrape_dt, everything())
# Feed each row of this data frame to your function
for (i in nov_20_major_accd$uuid) {
nov_20_major_accd %>%
get_accident2(i)
}
# Collapse the output into the mean number of traffic jam alerts
nov_20_major_accd %>%
get_accident1() %>%
group_by(n) %>%
count() %>%
mutate(average_jam_alert = (1*61+2*15+3*2+4*4) / (61+15+2+4))
## # A tibble: 4 x 3
## # Groups: n [4]
## n nn average_jam_alert
## <int> <int> <dbl>
## 1 1 61 1.38
## 2 2 15 1.38
## 3 3 2 1.38
## 4 4 4 1.38
chicago_data %>%
filter(subtype == "ACCIDENT_MAJOR") %>%
mutate(date = as.Date(scrape_dt)) %>%
mutate(time = substr(scrape_dt, 11, 19)) %>%
mutate(time = as.hms(time)) %>%
select(scrape_dt, date, time, subtype, everything()) %>%
mutate(dt = ymd_hms(scrape_dt)) %>%
group_by(dt) %>%
count() %>%
ggplot(aes(x = dt, y = n)) +
geom_line() + labs(title = "mean number of jam alerts around major accident")
## Warning: 21 failed to parse.
## Warning: Removed 1 rows containing missing values (geom_path).